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ABSTRACT 


Collaborative dialogue is rich in conscious and subconscious 
coordination behaviours between participants. This work 
explores collaborative learner dialogue through theories of 
alignment, analysing inter-partner movement and language 
use with respect to our hypotheses: that they interrelate, 
and that they form predictors of collaboration quality and 
learning. In keeping with theories of alignment, we find 
that linguistic alignment and gestural synchrony both corre- 
late significantly with one another in dialogue. We also find 
strong individual correlations of these metrics with collab- 
oration quality. We find that linguistic and gestural align- 
ment also correlate with learning. Through regression anal- 
ysis, we find that although interconnected, these measures in 
combination are significant predictors of collaborative prob- 
lem solving success. We contribute additional evidence to 
support the theory that alignment takes place across multi- 
ple levels of communication, and provide a methodological 
approach for analysing inter-speaker dynamics in a multi- 
modal task based setting. Our work has implications for the 
teaching community, our measures can help identify poorly 
performing groups, lending itself to informing the design of 
real time intervention strategies or formative assessment for 
collaborative learning. 
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1. INTRODUCTION 


Collaborative problem solving has long been a focus of ed- 
ucational research, and has been deemed an educational 
learning objective of critical importance in the 21st century 
workforce [12]. In the education literature, collaboration 
success is often analysed with respect to a joint problem 
space as created through learner interaction [31]. This joint 
problem space integrates learner shared goals, descriptions 
of the problem state, awareness of available problem solv- 
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ing actions and the associations between these aspects. The 
emergence of this shared conceptual space is constructed 
through shared language, situation and activity. 


Alignment in dialogue, the language component of this inter- 
action, is also commonly attributed to an automatic mecha- 
nism to achieve a shared understanding, or situation model 
[27]. This account of dialogue predicts alignment across 
various modes of communication, from word level to ges- 
ture and gaze patterns. Specifically in a task based setting, 
alignment is thought to aid mutual understanding [10, 27]. 
These theories of alignment and collaborative learning when 
taken together suggest that convergence at many levels of 
communication will take place in parallel over the course of 
collaborative dialogue, and that this alignment will be in- 
dicative of collaborative success. Additionally, the collabo- 
rative learning literature suggests that the effort necessary to 
build shared understanding is what actually leads to learn- 
ing [40], thus alignment, already found to be predictive of 
student learning in a teacher student context [42], may also 
be indicative of this. 


Investigating collaborative problem solving through the lens 
of alignment can give additional insights to this complex 
problem of convergence [38]. In this work, we examine the 
synchrony and convervence between students at both a lin- 
gusitic and gestural level, via inter-student metrics of lin- 
guistic alignment and movement synchrony. Of particular 
interest in this study is the separate coding of collaboration 
and learning in the learner dialogues. This allows for side by 
side comparison of the different modalities, and the analysis 
of their interaction with respect to these outcomes. Ges- 
tural and linguistic coordination between locutors has long 
been linked in dialogue, both properties having been indi- 
vidually explored for facilitating collaboration and learning 
in various settings. 


We offer an exploration of theoretically motivated metrics 
to capture synchronisation and alignment at the levels of 
linguistic expressions and movement patterns. We explore 
correlations between the measures themselves and between 
collaboration and learning. Exploring the relationship be- 
tween these measures we find strong correlations between 
modalities, in line with the collaboration and alignment lit- 
erature. Finally we explore the combination of these modal- 
ities in their predictive power for both collaboration and 
learning, finding that although they are interrelated, each 
plays a significant role in prediction quality. 
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1.1 Research Questions 

Motivated by the hypotheses we draw from the literature on 
collaboration, learning and alignment, we hypothesise that 
more successful learner dyads will converge in both their 
language use and their movement patterns to a visible de- 
gree, as the students align their mental models during the 
learning process. We also hypothesise that together, these 
aspects of interaction can provide useful tools in the analysis 
of student learning. We split our analysis into the following 
research questions: 

RQ1: Evidence of Convergence: Are linguistic align- 
ment, gestural synchrony and convervence effects between 
students higher than by chance task and vocabulary effects? 
RQ2: Convergence Vs Collaboration & Learning: How 
do our measures of convergence correlate with collaboration 
and learning? 

RQ3: Interaction: Linguistic Vs Gestural: Do these 
modalities correlate with one another? 

RQ4: Multimodal: Do combined measures of syn- 
chrony predict learning and collaboration outcomes: 
are the measures in combination predictive of learning and 
collaboration? 


2. BACKGROUND 


Evidence of convergence. Experimentally, in a two per- 
son dialogue setting, speakers have been found to converge, 
each aligning to their locutor at many levels of communica- 
tion: lexical, structural, gesture and conceptual[27]. Speak- 
ers have been found to spontaneously coordinate body pos- 
tures and gaze patterns during conversation [35]. Behaviour 
matching in multimodal communication has also been found 
to be temporally synchronised in collaborative task-based 
activity, when participants are facing each other [22]. Im- 
itation or mimicry between people unconscious of this be- 
haviour has been found in incidental mannerisms such as 
the bouncing of a foot, or rubbing a nose [9]. People imitate 
one another in dialogue across many different modalities, 
including lexical choice [17], accent [18], pauses [7], speech 
rate [43] and syntax [30, 4, 26]. This imitation has been 
linked to having social benefits, for example, [9] find that 
speakers in those pairs where their incidental mannerisms 
were mimicked perceived the interaction as running more 
smoothly than those whose were not. In terms of collabo- 
ration, multi modal behaviour matching has been found to 
occur in a synchronous manner in a task based collaborative 
dialogue where rapport and its role in learning and conver- 
gence was investigated [22]. Acoustic prosodic entrainment 
has also been found to correlate with rapport, a social qual- 
ity of the interaction, in collaborative learning dialogues [23]. 
Parallel to this, it has been argued that infants’ early skills 
of joint attention is their emerging understanding that other 
people exist as intentional agents [8], as they develop the- 
ory of mind. In terms of learning, lexical entrainment has 
been shown to correlate with success in multiparty student 
engineering group project meetings [16], where higher scor- 
ing teams were more likely to increase their entrainment 
in project words over the course of a dialogue, while lower 
scoring teams are more likely to diverge. Alignment level 
has been shown to vary with student ability [36], and con- 
vergence of lexical and speech features from student to tutor 
in spoken tutorial dialogue corpora has been shown to be a 
useful predictor of learning [42]. 


Language and gesture in learning. Gesture has an im- 
portant role in teaching and learning [32], as does language, 
which, at a structural level, has been shown to exhibit effects 
characteristic of both learning and implicitness, thought of 
as an aspect of alignment or coordination between interlocu- 
tors [15]. A wide range of lingusitic features derived from 
student dialogues have been found to be effective predic- 
tors of both learning gains and collaboration quality [29]. 
Categorising gesture in an educational setting often adopts 
the framework proposed by [24], of separating them into 
four basic types: beat (gestures devoid of topical content 
yet which lend temporal or emphatic structure i.e. hand 
tapping, head movement for emphasis), deictic (concrete or 
abstract pointing i.e. to match an object referred to as ‘this’ 
or ‘that’, or a concept in the past that is being referred to), 
iconic (also referred to as representational, i.e. making a 
gesture of putting a phone to ones ear), and metaphor (ges- 
tures to illustrate abstract concepts, such as moving hands 
together to illustrate mathematical convergence, or drawing 
a trend line in the air to demonstrate positive correlation). 
While gestures are pervasive, not all types are equally repre- 
sented in particular speech events, and are very dependent 
on the dialogue context, however, various studies have found 
that gesture and speech together provide a better index of 
mental representation than speech alone, and to be an im- 
portant aspect in learning [19, 11]. 


3. METHODS 

Speaker Utterance 

Left: then we need to turn left . again put an if and then turn the 
view book . 

Right: so another if do statement ? 

Right: was it just when you write in the sensor here , right into that . i 
say talk to motor ? all right , a andb. 

Left: if it is that , then we take a left . then turn left . 

Right: turn left , which is right here , right , so 

Left: put this inside that and then again , we need to turn left . 

Right: another if statement ? remember , control . and then talk to 
motor again . turn left 

Left: turn left comes after . 

Right: and we need one for ... 


Table 1: Example dialogue excerpt. Expressions in bold 
indicate shared lexical constructions. 


Experimental Setup. The experimental setup consisted of 
40 pairs of undergraduate students participating in a col- 
laborative problem solving task of programming a robot to 
traverse amaze. The participants had no prior programming 
experience. The participants sat facing a computer screen, 
and were recorded as they worked through the shared ex- 
ercises. During the collaborative aspect of the task, the fo- 
cus of our analysis in this work, participant dialogue was 
recorded and subsequently transcribed. Body movement 
data was also recorded via a Microsoft kinect sensor. This 
resulted in timestamped language and movement data for 
the 30 minute period of the task. The participants were 
individually given a pre and post test on a similar set of 
exercises, in order to evaluate the relative learning in the 
dyads. The learning assessment consisted of four short an- 
swer or fill-in-the-blank questions that assessed their under- 
standing of basic computer science competencies (adapted 
from [5, 44]). Learning gains were computed by subtracting 
pre-test scores from post-test scores and divided by the total 
number of points to be gained minus the pre-test [13]. Col- 
laboration was evaluated on a series of axes derived from the 
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collaboration assessment measures proposed in [25]: sustain- 
ing mutual understanding, dialogue management, informa- 
tion pooling, reaching consensus, task division, time man- 
agement, technical coordination, reciprocal interaction, and 
individual task orientation’. Each dimension was given a 
score between +2 and -2 (each score was defined with con- 
crete behaviors in a codebook). Researchers double-coded 
20% of the sessions and had a Cronbach’s alpha of .65 (75% 
agreement). An overall measure of collaboration was defined 
as the aggregate of all collaborative features. More details 
of the study, the data, experimental setup and coding of col- 
laboration and learning scores can be found in Reilly et al. 
2019 [29]. 


Dialogue transcripts. An example snippet from a dialogue 
with high collaboration score is provided in Table 1. The 
transcripts occasionally include some utterances from the 
facilitator, for whom we are not interested in measuring any 
alignment effects. The facilitator typically will speak most 
at the beginning of the dialogue, thus we remove all intial ut- 
terances (including participant) which interleave facilitator 
and participant. For the remainder of the dialogue, facilita- 
tor utterances are simply removed. Punctuation, although 
added at the annotators discretion, is retained, as it pro- 
vides valuable information about the pace of the language 
used, and indicates the fragmented nature of these dialogues. 
The transcripts were tokenised before analysis using the nltk 
python package’. 


Figure 1: Example student movement pattern over time 
(left) and student geometries (right). 


Movement Data. Student movement was processed using 
motion sensors from microsoft kinect, returning a series of 


geometries representing the students’ body positions in space. 


An example of two students sitting at the table (which ob- 
sures the lower half of their bodies) can be seen in Figure 1. 
We only include data from the time period where the stu- 
dents were performing the collaborative activity, which cor- 
responds to the transcript data of 30 minutes per session. 
In terms of pre-processing choices, our interest in the move- 
ment data is where the students mimic the gestures of their 
partner independent on their position relative to the camera 
or one another. This allows us to abstract from the postu- 
ral shapes, relative size, and dominant hand of the partic- 
ipant’s gestural patterns. We thus use averaged 30-second 
time slices® of the movement data (a measure of between 
frame positional difference). We further process this data 


‘Detailed descriptions of these measures used can be found 
in [25] 
?NLTK[21] python package http: //www-ultk.org 


3 Our choice of 30-second slices was in part informed by pre- 
vious work [38], and through qualitatively examining the 


to account for the differences in movement which the ex- 
perimental setup introduces: we apply standardisation* to 
each participant signal in order that two signals of differ- 
ent means and standard deviations can be compared on the 
same axis. This grants us a measure of variance similarity, 
which captures better the elements of beat gesture patterns 
separate from absolute movement differences, as we know 
the students consistently display different mean movement 
levels across dyads. 


3.1 Computing Lexical Alignment 

We operationalise linguistic alignment in this work at the 
lexical (word) level, derived from the dialogue transcripts, 
extracting shared expressions, which we define as any se- 
quence of tokens which contain at least one word (e.g. single 
punctuation marks are excluded). The automatic extrac- 
tion of shared expressions per dialogue is an instance of the 
longest common sub-sequence problem [20, 3]. For each dia- 
logue, we extract the inventories of shared expressions using 
the method proposed by Duplessis et al. [14]. For each of 
the two dialogue-specific inventories of shared constructions, 
we compute the following measures: 


e Expression Variety (EV): The lexical diversity of the 
expression vocabulary. 


e Expression Repetition (ER): The ratio of produced to- 
kens belonging to an instance of an established expres- 
sion 


e Vocabulary overlap (VO): Captures the richness of the 
shared vocabulary, the ratio of shared vocabulary present 
in the dialogue between participants: 


#-(wordSspeaker1 a wordSspeaker2) 
#:(wordSspeaker1 U wordSspeaker2) 


Individuals repeat and introduce expressions at different rates 
within dyads, thus we additionally calculate dyad level mea- 
sures to capture the symmetry between interlocutors. 


e Expression Initiator (IE) Difference: Difference in % 
of shared constructions introduced by each dialogue 
participant. Initiator describes the dialogue partici- 
pant to first use a subsequently shared and repeated 
construction. 


||[E(speaker one) — IE(speaker two)|| 


e Expression Repetition Difference: The difference in 
proportion of an individual speaker’s utterances which 
contain a usage of a shared construction: 


||ER(speaker one) — ER(speaker two)|| 


These measures capture the between speaker repetition within 
dialogue, which we use as a proxy for measuring the coor- 
dination or alignment between the speakers. An example of 
expression repetition can be found in Table 1 


data at 5-second and 30-second slices, 30-seconds seems to 
be sufficient to capture interesting aspects of finer-grained 
hand movement, but not so fine as to render average total 
movement data meaningless 

4Standardising a time series dataset involves re-scaling the 
distribution of values, also known as Z-normalisation 
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3.2 Computing Gestural Synchrony 

We use Dynamic Time Warping [33] as the measure of sim- 
ilarity between partner movement patterns as it has been 
found to be a consistently robust measure of time series 
similarity [2], which although introduced as a method for 
analysing speech signal [33], has been employed successfully 
in the field of gesture recognition [1]. DTW is a technique to 
find the optimal alignment between two time series, through 
the stretching or compression of either series along its time 
axis. This warping can be used to find corresponding re- 
gions between two time series, and serve as a distance mea- 
sure. This measure of distance allows us to capture slightly 
out of sync movement patterns, or those of slightly differnt 
phase length, to be deemed more similar than their differ- 
ence in slope would suggest. Our main motivation for choos- 
ing DTW over other metrics demonstrated suitable for this 
task such as [38, 28] is our hope to capture the similarity of 
slightly asynchronus movement, which, if indeed movement 
and linguistic alignment is linked, should follow a similar 
mixed turn taking behaviour as dialogue [41]. As well as 
providing a robust distance measure between two time se- 
ries, DTW returns a warping path which describes in what 
direction the time series needs to be moved in order to best 
align: in other words, the warping path can provide useful 
information about leader and follower dynamics which we 
exploit in our analysis. 


Inspired by common measures of gestural synchrony /body 
language, we investigate the following dyadic behavioural 
properties: 


e Movement Difference (Mdiff): Global mean movement 
difference in a pair. Calculated as the absolute value of 
the difference between the means of each participants. 


e Movement Synchrony (dtw_dist): The synchrony be- 
tween pairs in terms of their movement patterns, as 
measured by the Dynamic Time Warping (DTW)[33] 
distance 


e Leader Follower dynamics (diffLF’): The directionality 
of the alignment of the movement between the pair. 
This metric is derived from the DTW path. 


Additionally, to measure whether these similarities become 
more pronounced as the session progresses, we divide the ses- 
sion in half by timestamp, and compute the measures per 
half. We use the difference between dialogue halves (second 
- first) as a measure to capture convergence. This results 
in three additional measures corresponding for those syn- 
chrony measures above: Mdiff_change, dtw_dist_change and 
diffLF_change. The aspects of the body geometries which 
we focus on consist of the points for Head, Hands, Shoul- 
ders and Total (average) movement. For hand and shoulder 
measures, these are defined as an average of the movement 
in the right and left points for each aspect. 


3.3 Measure Validation - Baseline 

A certain level of similarity between speakers will exist in- 
dependently of their adapting to one another. Due to their 
performing the same task, vovabulary will necessarily be 
constrained by topic, and consistent across pairings. Due to 


the experiment configuration, task specific gesture patterns 
such as moving the robot, or interacting with the computer, 
as well as to which side of them their interlocutor is will also 
lead to movement similarities, e.g. turning to the right vs 
the left to speak. We thus create baselines for both dialogue 
and movement data which demonstrate the levels of similar- 
ity inherent to the task setup. For the dialogue baseline, we 
create a scrambled version of the corpus by retaining the ut- 
terances of one of the students and interleaving it with utter- 
ances randomly drawn from another pair, per speaker. For 
a partner specific movement baseline, the movement data 
from each student is randomly paired with the data from 
another student on the same side relative to them as their 
partner was (i.e for each participant on the right hand side, 
replace their partner with a participant from the left hand 
side). To further check task specific effects of the seating 
configuration, we pair students sitting in the same position 
with one another, in order to confirm that the role does not 
show more similarities than the origional student pairings. 


4. RESULTS 
4.1 Analysis 1: Measuring Convergence 


Linguistic. We firstly hypothesise that there will be signifi- 
cant inter dyad repetition beyond what the task demands by 
chance, since alignment has been linked to both learning, as 
well as collaboration, and this same measure has found sig- 
nificant alignment levels in negotiation[14], as well as in sec- 
ond language tutoring dialogue[37], although this dialogue 
setting is different since both speakers are learners. Firstly 
we explore whether alignment is greater than by chance: we 
therefore compare the original dialogues to the shuffled base- 
line in the same manner as [14]. The expression variety is sig- 
nificantly higher for the original (mean=0.118, std=0.023) 
than for the shuffled dialogues (mean=0.110, std=0.015). 
Statistical difference is checked by a Wilcoxon rank sum test 
(U = 1141, p = 0.03 < 0.05, r = 0.21)° This indicates that 
there exists a richer and more dyad specific expression lexi- 
con. The expression repetition is also significantly higher for 
the original (mean=0.509, std=0.123) than for the shuffled 
dialogues (mean=0.487, std=0.109) (U = 1079.5, p = 0.014 
< 0.05, r = 0.25). This means that the level of repetition 
between student dyads is not simply incidental, and can be 
attributed to alignment or routinisation effects. Finally, as 
a measure of how task specific the vocabulary is, we find the 
vocabulary overlap between speakers significantly higher in 
the original (mean=0.509, std=0.123) than in the shuffled 
dialogues (mean=0.487, std=0.109) (U = 856, p = 0.0002 
< 0.001, r = 0.41). This difference demonstrates that stu- 
dents share a much richer vocabulary than would happen 
by chance in performing this task. Overall, these results 
show that the collaborative student dialogues constitute a 
richer expression lexicon than they would by chance, indi- 
cating that the students align to one another, resulting in 
their langauge converging [10, 27]. 


Gestural. We hypothesise that our measures of movement 
matching will result in higher partner-specific synchrony than 


°Following [14], for each test, we report the test statistics 
(U/W), the p - value (p) and the effect size (r) 
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in our baseline: i.e. lower distance between pairs collabo- 
rating than simply any student performing the same task 
with a different partner. We find that within dyad DTW 
similarity is significantly higher than in both the partner 
substitution baseline (t= -2.0401, p < 0.05), and the within 
side baseline (t = -2.0397, p < 0.05). This indicates DTW 
is a useful measure of movement similarity in this setting 
showing that this method is suitable for capturing partner 
specific effects of these movement patterns and can allow us 
to use this to compare similarities between dyads. 


4.2 Analysis 2: Convergence Correlated with 
Collaboration and Learning 


Lingusitic alignment. We hypothesise that alignment be- 
tween students will correlate with learning, since in other 
tutoring settings, it has been found to correlate with both 
learning gains [42] and linguistic ability [36]. Additionally, 
global language features of collaborative dialogue have also 
found to correlate with learning gains [29]. To answer RQ2, 
we compare our linguistic alignment metrics with learning 
and collaboration scores. We find support for our hypothe- 
sis about collaboration and alignment correlating. We find 
ER (r 0.680 p < 0.001), EV (r 0.622 p < 0.001), and VO 
(r 0.663, p <0.001) all correlate with collaboration using 
Pearson’s r correlation coefficient. This shows that inter 
partner repetition is important, and that students will con- 
verge even to less common language in a collaborative set- 
ting. We also find our between learner measures signifi- 
cantly correlated with collaboration, IE_diff (r=-0.570, p 
< 0.01) and ER_diff (r=-0.54, p < 0.001) meaning that 
smaller differences between learner initiation and repetition 
of shared expressions correlates with how well they collab- 
orate. EV (r=0.442, p =0.026), ER (r=0.442, p =0.006) 
and VO (r=0.349, p =0.034) also correlate with learning, 
although to a lesser degree. An intuition as for why, is that 
in other studies reporting alignment correlation with learn- 
ing analyse dialogues conducted in an asymmetric tutoring 
setting, where adopting the language of the teacher is a sen- 
sible learning strategy as it is assumed that this language is 
correct. In our case, since these dialogues are between peer 
learners, the learning outcome is somewhat dependent on 
the rapport within the dyad, and the information aligned to 
being correct. In other words, in some cases, the learners 
may be converging to a shared mental representation, but it 
may not be the correct one. In keeping with this observation 
of dyad rapport and equality, IE_diff (r=-0.487, p = 0.002) 
and ER_diff (r=-0.515, p = 0.001) both show strongly that 
more equal contributions from the students in terms of re- 
peating one another, and in introducing words upon which 
to align correlate with learning. 


Movement Synchrony and Convergence. We hypothe- 
sise that movement synchrony and convergence, as defined 
by DTW distance and its change over the interaction, will 
provide a robust measure of synchrony which will better dis- 
tinguish between dyads with differing activity levels, which 
in turn should correlate with collaboration and learning, in 
keeping with previous results with other measures in task 
based dialogue [22, 32, 28]. We compare our movement sim- 
ilarity metric with learning and collaboration scores. Overall 


Movement Synchrony Movement Change 


Head_dtw_dist 0.2 10.26 o 
Hands_dtw_dist 0.15 re 
Shoulders_dtw_dist -0.18 £0.28 
Mtotal_dtw_dist 
Head diflF  -0.12 0.11 0.24 thas 
Hands_diftLF (NOM2N) 0.27 2015 in 
Shoulders_diffLF 0.14 0.11 0.12 
Mtotal_diffLF 0.13 
Head Mdiff  -0.17 0.18 pe 
Hands Mdiff 0.18 0.24 0.098 
Shoulders Mdiff  -0.15 0.19 0.17 0.21 “OS 
Mtotal_Mdiff  -0.05 0.017 0.17 01 - 


Collaboration Learning 


Collaboration Learning 


Figure 2: Gestural synchrony and convergence vs. Collabo- 
ration and Learning measures correlation with Pearson’s r. 
Significant p values reported in the text. 


as can be seen from Figure 2, we find average movement syn- 
chrony to correlate with both collaboration and learning. In 
terms of learning, DTW_dist measures for Head (r = -0.611, 
p > 0.001) Hands (r = -0.561, p = 0.002), Shoulders (r 
= -0.609, p > 0.001), and Mtotal (r = -0.611, p =0.006) 
all significantly correlate with learning to a strong degree. 
Convergence between dyads in terms of Mtotal (r=-0.519, p 
= 0.006) and Hands(r=-0.467, p = 0.014) also significantly 
correlate with learning. Finally, dyads becoming more dis- 
similar in terms of hand movement (having a stronger leader 
follower dynamic) also significantly correlates with learn- 
ing: Hands_diffLF (r=0.471, p = 0.013). With collabora- 
tion, Head Hands Shoulders and Mtotal all significantly (p < 
0.05) correlate. As does the diffLF for Hands (r=-0.42, p = 
0.029) and Mtotal (r=-0.519, p = 0.006). Overall, the results 
are intuitive: we find that more synchronus pairs as mea- 
sured by DTW distance significantly correlate with collabo- 
ration quality. We also find that convergence between dyads 
is present (negative correlation between dtw_dist change and 
Mdiff change show greater similarity between learners over 
time) and correlates with learning quite strongly for some 
movement metrics. We also see a positive correlation be- 
tween the diffLF change features, particularly with learning, 
indicating that while convergence of behaviour is important, 
some aspects of turn taking and initiative are separate to 
this. 


4.3 Analysis 3: Comparing Linguistic and Ges- 


tural Convergence 
Comparing Linguistic and Gestural convergence, we hypoth- 
esise these aspects of communication will correlate with one 
another, as previous literature suggests [17, 4, 22]. To an- 
swer research question (RQ3), we contrast the modalities 
themselves. We split this comparison to compare gestural 
and linguistic coordination. We hypothesise that movement 
synchrony and linguistic alignment will correlate strongly, 
due to the process of speakers’ alignment of shared men- 
tal representations taking place across various linguistic and 
paralinguistic levels [6, 27]: if dyads align at the lexical level, 
it is likely that the same process leading to this alignment 
will affect the gestural level also [27]. The DTW path al- 
lows us to capture the relationship between slightly offset 
movement patterns of beat gesture mimicry [24], and the 
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Alignment Vs Movement Synchrony — Alignment Vs Movement Change 
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Figure 3: Movement measures vs. linguistic alignment mea- 
sures. Pearson’s r correlation coefficient. 


case where linguistic alignment in turn taking utterances is 
low, which can lead to more synchronised patterns of move- 
ment [32]. Previous work has also found evidence of the co- 
ordination of of lexical alignment and gestural behaviour in a 
multimodal [35, 9, 22] context, we thus hypothesise that col- 
laborative problem solving dialogues will also demonstrate 
this. We find significant correlation between linguistic align- 
ment measures and movement synchrony across all move- 
ment patterns (Figure 3), with strongest effects for head (r 
= -0.735, p-value: 1.225) and hands (r = -0.706 p-value: 
3.811), providing support that our hypothesis about lexical 
patterns influencing gestural alignment at the level of beat 
gesture and head nodding may be be true for this setting. 
In terms of convergence, there is a significant correlation 
with change in hand movement and expression variety (r 
= -0.387, p-value: 0.0458). Interestingly, the difference be- 
tween speakers lingusitic (Diff_IE, Diff_ER) patterns posi- 
tively correlates with both their difference in movement syn- 
chrony and speaeker divercence, indicating that asymmetric 
relationships between students are visible across modalities 
of communication. Inkeeping with our hypothesis we find 
strong negative correlation between between speaker differ- 
ence in movement and divergence (strong correlation be- 
tween similarity and convergence) with the linguistic mea- 
sures of convergence, providing supporting evidence for the 
hypothesis that lingusitic and gestural convergence are part 
of the same underlying communicative process. 


4.4 Analysis 4: Predicting Learning and Col- 


laboration 
Finally, in order to answer research question (RQ4), to find 
combined interaction effects of the various inter modality 
measures, we fit a series of mixed effect regression mod- 
els°. We hypothesise that while each measure individually 
is strong, and although the measures themselves are corre- 
lated, each modality will provide its own distinct informa- 
tion contributing to learning and collaboration aspects. We 
perform backward step wise model selection to select the 
best predictors, firstly fitting each model with all relevant 
variables and stopping only when all remaining terms have 
significance p < 0.05. Although RMSE and r? values of 


®To fit the data and perform the statistical tests within this 
paper, we use the Statsmodels python package [34] 


Table 2: Mixed effects Regression model multimodal results 


Formula 

Learning) RMSE:5.48 _7:0.90 

Learning ~ EV: ER + DifflE * DifflE + Handmean_movement.s0_diffLF Change + 
Shouldermean_movement.30_diffL Change 

Collaboration RMSE:0.15 _r2:0.999 

Collaboration ~ EV * ER + DifflE « Diff.ER + Head_movement_30_dtw_dist 
+ Handmean_movement.30_dtw_dist + — Shouldermean_movement_30_dtw_dist 
+  movement_total.30_dtw_dist + Head_movement.30_diffLFChange +  Shoul- 
dermean_movement.30_diffLFChange + movement_total.30_diffLF Change + 
Head_movement.30_dtw_dist_change + Handmean_movement.30_dtw_dist_change + 
Shouldermean_movement.30_dtw-dist_change + —movement-total_30-dtw_dist.change 
+  Head.movement.30_diffLF + Handmean_movement.30_diffLF + Shoulder- 
mean_movement_30_diffLF -+ movement_total_30_diffLF 


predicted data are highest when combining all factors, we 
wished to discover the minimally significant descriptive set 
of criteria in order to find more interaction in our results. 


Table 2 shows the minimal significant set of linguistic and 
gestural factors and their interaction in terms of their ability 
to predict the dependent variables of learning and collabo- 
ration. Each modality separately can form a good predic- 
tor of both alignment and learning in this setting. How- 
ever, this analysis offers strong support for the multimodal 
modelling of collaborative problem solving, proving that al- 
though correlating with one another, both linguistic and 
gestural aspects have an independent role to play when pre- 
dicting learning and collaboration. Broadly, from Table 2, 
the gestural features chosen indicate that both the measures 
of synchrony, and those for convergence (_change features) 
play a role in prediction. It is also clear that predicting col- 
laboration in this case is easier than learning. This may be 
influenced by ceiling effects or ease of pre-test being a lim- 
iting factor. e.g. a learner with very good pretest score will 
have hit a ceiling by the end of the session. 


5. DISCUSSION & CONCLUSION 


We find significant levels of both linguistic and movement 
synchrony in our data (RQ1). In answer to RQ2, we find 
our measures of linguistic and gestural alignment correlate 
with collaboration. In terms of learning, we find that the dif- 
ference in repetition between students negatively correlates 
with learning, that movement synchrony in general shows 
strong correlation with learning. In terms of RQ38, we find 
significant strong to medium effects when correlating mea- 
sures of ER and dtw_dist with one another. This contributes 
to a growing body of evidence in support of theories of in- 
teractive alignment emerging across communicative modal- 
ities. Finally, via regression analysis combining our metrics 
(RQ4), we find that although separately powerful, a com- 
bination of modalities can best explain collaboration and 
learning outcomes. Our findings show the importance of 
analysing between speaker dynamics to capture nuances of 
learning. Our findings also suggest the use of a multimodal 
approach for the best understanding of these interactions. 
We also contribute interesting new evidence adding to work 
exploring the relationship between linguistic alignment and 
gestural and movement similarity. Our findings, while lim- 
ited to a small specific setting, contribute evidence to sup- 
port existing theories of human cognition and alignment. 
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